Midterm Project: Effects of Air Pollution on Countries

1. Introduction

Motivation

Climate issues are pervasive but typically disproportionately affect low income communities and developing countries. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s economic and social position can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?

Set Up

Before we start, we need to ensure that we have all the relevant libraries installed and imported.

Run these in the console, or only the ones that your system does not have, to install packages in addition to the ezids package.

install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")
install.packages("dplyr")
install.packages("knitr")
install.packages("magrittr")

2. Data Sources and Data Wrangling

Data Sources

For our analysis, we will be working with 5 main data sources shown in the table below:

Figure 1: Data Sources
Data Source Link
Deaths Due to Air Pollution of Countries from 1990 - 2017 Kaggle Link
GDP Annual Growth of Countries from 1960 - 2020 Kaggle via WorldBank Link
United Nations Population and Region Data United Nations Link
United Nations ISO-alpha3 code United Nations Link
spData for Map Geometries spData for Mapping Link

The main variables in our datasets will include:

Figure 2: Key Variables
Feature Data Type Unit of Measure Notes and Assumptions
GDP (Gross Domestic Product) Numerical, Continuous $USD This is our chosen proxy for measuring a country’s economic status
Population Size Numerical, Continuous thousands of people Annual UN estimated
Deaths due to Air Pollution Numerical, Continuous deaths per million This is our chosen proxy for measuring the negative affects of air pollution.
Country Qualitative, Categorical N/A 231 countries
SDG Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Region Classification.
Sub Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Sub-Region Classification.
ISO-alpha3 Country Code Qualitative, Categorical N/A Standard for identifying countries (text ID).
ISO-alpha2 Country Code Qualitative, Categorical N/A Another standard for identifying countries (text ID).
M49 Country Code Numerical, Categorical N/A Another standard for identifying countries (numerical ID).
Year Numerical, Categorical N/A 1990 to 2017
GDP per Capita Numerical, Continuous $USD per person Normalization of GDP to compare between population sizes (calculated).

Data Wrangling

While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.

UN Data Sample Messy
Figure 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
UN Data Sample Cleaned
Figure 4: Sample screenshot of transformed UN dataset.

3. Load, Clean, and Inspect Data

Load Data

Figure 5: Structure of country_codes_df
variable class first_values
Country.or.Area character Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania
ISO.alpha2.code character AD, AE, AF, AG, AI, AL
ISO.alpha3.code character AND, ARE, AFG, ATG, AIA, ALB
M49.code integer 20, 784, 4, 28, 660, 8
Figure 6: Structure of air_pollution_df
variable class first_values
Entity character Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan
Code character AFG, AFG, AFG, AFG, AFG, AFG
Year integer 1990, 1991, 1992, 1993, 1994, 1995
Air.pollution..total…deaths.per.100.000. double 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243
Indoor.air.pollution..deaths.per.100.000. double 250.362909742375, 242.575124973334, 232.043877894811, 231.648133503794, 238.837176822107, 239.906598716878
Outdoor.particulate.matter..deaths.per.100.000. double 46.4465894382846, 46.0338405670284, 44.2437660321924, 44.4401481443785, 45.5943284100213, 45.3671411300974
Outdoor.ozone.pollution..deaths.per.100.000. double 5.61644203074918, 5.60396011603667, 5.61182206482564, 5.65526606275628, 5.71892222061506, 5.73917378233707
Figure 7: Structure of gdp_df, not showing all years in table to retain space but years go up to X2020
variable class first_values
Country.Name character Aruba, Afghanistan, Angola, Albania, Andorra, Arab World
Country.Code character ABW, AFG, AGO, ALB, AND, ARB
Indicator.Name character GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\))
Indicator.Code character NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD
X1960 double NA, 537777811.111111, NA, NA, NA, NA
X1961 double NA, 548888895.555556, NA, NA, NA, NA
X1962 double NA, 546666677.777778, NA, NA, NA, NA
X1963 double NA, 751111191.111111, NA, NA, NA, NA
X1964 double NA, 800000044.444444, NA, NA, NA, NA
X1965 double NA, 1006666637.77778, NA, NA, NA, NA
Figure 8: Structure of population_region_df, not showing all years in table to retain space but years go up to X2020
variable class first_values
SDGRegion character SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA
SubRegion character Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa
Country character Burundi, Comoros, Djibouti, Eritrea, Ethiopia, Kenya
Notes integer NA, NA, NA, NA, NA, NA
Country.code integer 108, 174, 262, 232, 231, 404
Type character Country/Area, Country/Area, Country/Area, Country/Area, Country/Area, Country/Area
Parent.code integer 910, 910, 910, 910, 910, 910
X1950 character 2 309, 159, 62, 822, 18 128, 6 077
X1951 character 2 360, 163, 63, 835, 18 467, 6 242
X1952 character 2 406, 167, 65, 849, 18 820, 6 416
Figure 9: Structure of world, not showing geom feature in table as it has unique list of values per row and therefore is extremely large to display.
variable class first_values
iso_a2 character FJ, TZ, EH, CA, US, KZ
name_long character Fiji, Tanzania, Western Sahara, Canada, United States, Kazakhstan
continent character Oceania, Africa, Africa, North America, North America, Asia
region_un character Oceania, Africa, Africa, Americas, Americas, Asia
subregion character Melanesia, Eastern Africa, Northern Africa, Northern America, Northern America, Central Asia
type character Sovereign country, Sovereign country, Indeterminate, Sovereign country, Country, Sovereign country
area_km2 double 19289.9707329765, 932745.792357074, 96270.6010408472, 10036042.9767873, 9510743.74482458, 2729810.51298781
pop double 885806, 52234869, NA, 35535348, 318622525, 17288285
lifeExp double 69.96, 64.163, NA, 81.9530487804878, 78.8414634146341, 71.62
gdpPercap double 8222.25378436842, 2402.09940362843, NA, 43079.1425247165, 51921.9846391384, 23587.3375151466

Clean Data

First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).

Clean air_pollution_df:

Figure 10: Structure of air_pollution_df_cleaned
variable class first_values
Country integer Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan
ISO.alpha3.code integer AFG, AFG, AFG, AFG, AFG, AFG
Year integer 1990, 1991, 1992, 1993, 1994, 1995
Deaths.Air.Pollution.per.100k double 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243

Clean gdp_df:

Figure 11: Structure of gdp_df_cleaned
variable class first_values
Country integer Aruba, Aruba, Aruba, Aruba, Aruba, Aruba
ISO.alpha3.code integer ABW, ABW, ABW, ABW, ABW, ABW
Year integer 1986, 1987, 1988, 1989, 1990, 1991
GDP.USD double 405463417.11746, 487602457.746416, 596423607.114715, 695304363.031101, 764887117.194486, 872138715.083799

Clean population_region_df:

Figure 12: Structure of population_region_df_cleaned
variable class first_values
SDGRegion integer SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA
SubRegion integer EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica
Country integer Burundi, Burundi, Burundi, Burundi, Burundi, Burundi
M49.code integer 108, 108, 108, 108, 108, 108
Year integer 1950, 1951, 1952, 1953, 1954, 1955
Population.thousands double 2309, 2360, 2406, 2449, 2492, 2537

Clean population_region_df:

Figure 13: Structure of country_codes_df
variable class first_values
Country.or.Area integer Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania
ISO.alpha2.code integer AD, AE, AF, AG, AI, AL
ISO.alpha3.code integer AND, ARE, AFG, ATG, AIA, ALB
M49.code integer 20, 784, 4, 28, 660, 8

Clean world:

Figure 14: Structure of world_df_cleaned, not showing geom feature in table as it has unique list of values per row and therefore is extremely large to display.
variable class first_values
iso_a2 integer FJ, TZ, EH, CA, US, KZ

Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.

Final DataFrame Construction

Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).

Figure 15: Structure of final_df
variable class first_values
ISO.alpha2.code integer AD, AD, AD, AD, AD, AD
M49.code integer 20, 20, 20, 20, 20, 20
Year integer 2012, 2013, 1990, 1991, 1992, 1993
ISO.alpha3.code integer AND, AND, AND, AND, AND, AND
Country.x integer Andorra, Andorra, Andorra, Andorra, Andorra, Andorra
Deaths.Air.Pollution.per.100k double 17.6754871826169, 17.1893417774086, 29.0238806202567, 28.6956788863825, 28.4603211317312, 27.8408717612189
GDP.USD double 3188808942.56713, 3193704343.20627, 1029048481.88051, 1106928582.86629, 1210013651.87713, 1007025755.00065
SDGRegion integer EUROPE, EUROPE, EUROPE, EUROPE, EUROPE, EUROPE
SubRegion integer SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope
Population.thousands double 82, 81, 55, 57, 59, 61
geom list list(), list(), list(), list(), list(), list()
gdp.per.capita double 38887913.9337455, 39428448.6815589, 18709972.3978275, 19419799.6994086, 20508705.9640192, 16508618.9344369

Our dataset is finally ready to be analyzed.

4. EDA - Exploratory Data Analysis

Quick Plots

Let’s start our EDA process by just looking at some quick plots to look at the distribution of data.

Histogram of Air Pollution Induced Deaths, Population, and GDP per Capita

Figure 16: Histogram of Air Pollution Induced Deaths per 100k.
Figure 17: Histogram of Population.
Figure 18: Histogram of GDP per Capita.

Looks like deaths.air.pollution.per.100k, population, and gdp.per.capita are not normal and are all right skewed.

Boxplot of Air Pollution Induced Deaths, Population, and GDP per Capita

Let’s look at a boxplot for the outliers.

Figure 19: Boxplot of Deaths per 100,000 from Air Pollution vs SDG Region

Interesting to note that Australia/New Zealand, Europe, North America seem to have the lowest deaths per 100k from air pollution and are all fairly compactly packed together (low variance) relative to other regions around the world. Furthermore, these region contain the most advanced countries.

Let’s take another look but at SubRegions.

Figure 20: Boxplot of Deaths per 100k from Air Pollution vs Sub Region

Separating out into an even granular grouping of regions show some trends where Australia/New Zealand, North America, Northern Europe, and Western Europe all have low deaths per 100k and have low variance. Historically, these regions consist of countries that have been considered ‘First World’ before our first year of analysis of 1990. We will dig into this more later in our SMART questions.

What does the GDP per capita of these regions look like comparatively? Let’s take a look.

Figure 21: Boxplot of GDP per Capita vs Sub Region

Interesting to observe that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively. We will try to see if we can quantify this relationship later on in our main research analysis.

Map of Countries

Due to the nature of our data set, plotting maps and maps with intensities can add another dimension to how we visualize and therefore understand our data.

Figure 22: Global Map of SDGRegions and SubRegions
Figure 23: Global Intensity Map of Key Numerical Features, 1990 to 2017

Looks like some inverse correlation between gdp.per.capita and deaths.air.pollution.per.100k.

We can also use ggplot2 to have a bit more control over map plotting.

Figure 24: Global Intensity Map of Deaths due to Air Pollution per 100k People, 1990 to 2017
Figure 25: Intensity Map of Deaths due to Air Pollution per 100k People in East and Southeastern Asia, 2017

SMART Questions

1. Is there a relationship between population size and Deaths per 100,000 due to air pollution?

Below, we would like to measure the relationship between Population size (in thousands) and Deaths per 100,000 due to air pollution. Since these variables are numerical, we have to confirm the normal distribution of both variables, and from the results below, we see that there is no correlation between a country’s population size and their deaths due to air pollution. we do observe a negative correlation between Deaths due to air pollution and GDP per Capita.

## 'data.frame':    5197 obs. of  12 variables:
##  $ ISO.alpha2.code              : Factor w/ 248 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ M49.code                     : Factor w/ 249 levels "4","8","10","12",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Year                         : Factor w/ 28 levels "1990","1991",..: 23 24 1 2 3 4 5 6 7 8 ...
##  $ ISO.alpha3.code              : Factor w/ 197 levels "","AFG","AGO",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Country.x                    : Factor w/ 231 levels "Afghanistan",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Deaths.Air.Pollution.per.100k: num  17.7 17.2 29 28.7 28.5 ...
##  $ GDP.USD                      : num  3188808943 3193704343 1029048482 1106928583 1210013652 ...
##  $ SDGRegion                    : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ SubRegion                    : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 19 19 19 19 19 19 19 19 19 19 ...
##  $ Population.thousands         : num  82 81 55 57 59 61 63 64 64 64 ...
##  $ geom                         :sfc_MULTIPOLYGON of length 5197; first list element:  list()
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  $ gdp.per.capita               : num  38887914 39428449 18709972 19419800 20508706 ...

## [1] 0.037
##                               Deaths.Air.Pollution.per.100k
## Deaths.Air.Pollution.per.100k                         1.000
## Population.thousands                                  0.069
## gdp.per.capita                                       -0.543
##                               Population.thousands gdp.per.capita
## Deaths.Air.Pollution.per.100k                0.069         -0.543
## Population.thousands                         1.000         -0.040
## gdp.per.capita                              -0.040          1.000
Table
Deaths.Air.Pollution.per.100k Population.thousands gdp.per.capita
Deaths.Air.Pollution.per.100k 1.000 0.069 -0.543
Population.thousands 0.069 1.000 -0.040
gdp.per.capita -0.543 -0.040 1.000

3. Which regions have the lowest and highest deaths due to air pollution?

## tibble [252 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ SDGRegion: Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year     : Factor w/ 28 levels "1990","1991",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ total    : num [1:252] 50.5 49.1 48.9 47.3 45.9 ...
##  - attr(*, "groups")= tibble [9 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ SDGRegion: Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 1 2 3 4 5 6 7 8 9
##   ..$ .rows    : list<int> [1:9] 
##   .. ..$ : int [1:28] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int [1:28] 29 30 31 32 33 34 35 36 37 38 ...
##   .. ..$ : int [1:28] 57 58 59 60 61 62 63 64 65 66 ...
##   .. ..$ : int [1:28] 85 86 87 88 89 90 91 92 93 94 ...
##   .. ..$ : int [1:28] 113 114 115 116 117 118 119 120 121 122 ...
##   .. ..$ : int [1:28] 141 142 143 144 145 146 147 148 149 150 ...
##   .. ..$ : int [1:28] 169 170 171 172 173 174 175 176 177 178 ...
##   .. ..$ : int [1:28] 197 198 199 200 201 202 203 204 205 206 ...
##   .. ..$ : int [1:28] 225 226 227 228 229 230 231 232 233 234 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

4. How does deaths due to air pollution increase over time? More specifically, are death rates in recent X amount of years higher than death rates from groups of X years before?

5. Main Research Question

Do lower GDP countries have more deaths per 100k due to air pollution?

Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?

Linear Fit

Let’s first look at the general fit on the overall data.

Figure XX: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita, 1990 to 2017.

From the plot, we observe that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.295. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.

Even looking at each individual SDGRegion, their linear fits get better overall but are still not particularly strong with the highest being Australia/New Zealand and Europe at R2 of 0.56 and 0.55 respectively.

Figure XX: Linear models for each SDGRegion, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.

Let’s now look at how slicing by annual changes plays a part.

Figure XX: Linear models for each Year, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.

As observed, time does not seem to play a significant part in describing the relationship between deaths due to air pollution per 100k vs GDP per capita as the R2 stays roughly constant around 0.3 across all the years.

Transformed Log Scale - Linear Fit

Perhaps we should look at a non-linear fit. From our visuals, we see that every plot starts off at really high deaths due to air pollution per 100k then drops off dramatically as GDP per capita increases. However, the drop off begins to tamper off and asymptotically approaches some value. (It will be interesting to see if we can generalize what that GDP per capita value is. Let’s table that for later.) We have seen this type of behavior before in log graphs such as one shown below.

Sample Log Graph
Figure XX: Sample log graph.

Our data seems to be a -log(x) instead of log(x). Let’s transform our linear fit to a log fit by wrapping our features into a log() function and fitting back to a linear fit and see what the relationship is.

fit2’s summary statistics are:

## 
## Call:
## lm(formula = log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita), 
##     data = final_df_sf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.099 -0.235  0.000  0.206  1.431 
## 
## Coefficients:
##                     Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)         10.07849    0.04871     207 <0.0000000000000002 ***
## log(gdp.per.capita) -0.38952    0.00323    -121 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.369 on 5195 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.737 
## F-statistic: 1.45e+04 on 1 and 5195 DF,  p-value: <0.0000000000000002

Let’s replot with this new fit.

Figure XX Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship by SDGRegion.
Figure XX Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship by years.
Figure XX Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship.

Across the board, the strength of our linear relationship increases dramatically when first transforming both features by the log() function first. The new R2 is now 0.737 which means around 74% of the variance in our target feature can be explained by this mathematical relationship.

Let’s test a few more regression models by adding more features and see what happens.

fit3 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion, data=final_df_sf)
fit4 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion+Year, data=final_df_sf)
fit5 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*Year, data=final_df_sf)

R2 values for adding more features are in fit3, fit4, and fit5 are 0.883, 0.887, and 0.747 respectively.

Let’s check out the VIFs to see if we should keep any of our new models.

Figure XX: VIF of lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion+Year
log(gdp.per.capita) log(gdp.per.capita):SubRegionCaribbean log(gdp.per.capita):SubRegionCentralAmerica log(gdp.per.capita):SubRegionCentralAsia log(gdp.per.capita):SubRegionEasternAfrica log(gdp.per.capita):SubRegionEasternAsia log(gdp.per.capita):SubRegionEasternEurope log(gdp.per.capita):SubRegionMelanesia log(gdp.per.capita):SubRegionMicronesia log(gdp.per.capita):SubRegionMiddleAfrica log(gdp.per.capita):SubRegionNorthernAfrica log(gdp.per.capita):SubRegionNORTHERNAMERICA log(gdp.per.capita):SubRegionNorthernEurope log(gdp.per.capita):SubRegionPolynesia log(gdp.per.capita):SubRegionSouth-EasternAsia log(gdp.per.capita):SubRegionSouthAmerica log(gdp.per.capita):SubRegionSouthernAfrica log(gdp.per.capita):SubRegionSouthernAsia log(gdp.per.capita):SubRegionSouthernEurope log(gdp.per.capita):SubRegionWesternAfrica log(gdp.per.capita):SubRegionWesternAsia log(gdp.per.capita):SubRegionWesternEurope SubRegionCaribbean SubRegionCentralAmerica SubRegionCentralAsia SubRegionEasternAfrica SubRegionEasternAsia SubRegionEasternEurope SubRegionMelanesia SubRegionMicronesia SubRegionMiddleAfrica SubRegionNorthernAfrica SubRegionNORTHERNAMERICA SubRegionNorthernEurope SubRegionPolynesia SubRegionSouth-EasternAsia SubRegionSouthAmerica SubRegionSouthernAfrica SubRegionSouthernAsia SubRegionSouthernEurope SubRegionWesternAfrica SubRegionWesternAsia SubRegionWesternEurope
975 7945 5044 3147 9101 2492 5911 2988 2640 5154 3892 3611 5909 1944 5987 7157 3933 5166 7168 9144 9611 5771 6747 3902 2131 5646 2101 4744 2281 2106 3457 2936 3714 5880 1584 4427 5680 3037 3437 6334 5707 8049 5965
Figure XX: VIF of lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion+Year.
log(gdp.per.capita) log(gdp.per.capita):SubRegionCaribbean log(gdp.per.capita):SubRegionCentralAmerica log(gdp.per.capita):SubRegionCentralAsia log(gdp.per.capita):SubRegionEasternAfrica log(gdp.per.capita):SubRegionEasternAsia log(gdp.per.capita):SubRegionEasternEurope log(gdp.per.capita):SubRegionMelanesia log(gdp.per.capita):SubRegionMicronesia log(gdp.per.capita):SubRegionMiddleAfrica log(gdp.per.capita):SubRegionNorthernAfrica log(gdp.per.capita):SubRegionNORTHERNAMERICA log(gdp.per.capita):SubRegionNorthernEurope log(gdp.per.capita):SubRegionPolynesia log(gdp.per.capita):SubRegionSouth-EasternAsia log(gdp.per.capita):SubRegionSouthAmerica log(gdp.per.capita):SubRegionSouthernAfrica log(gdp.per.capita):SubRegionSouthernAsia log(gdp.per.capita):SubRegionSouthernEurope log(gdp.per.capita):SubRegionWesternAfrica log(gdp.per.capita):SubRegionWesternAsia log(gdp.per.capita):SubRegionWesternEurope SubRegionCaribbean SubRegionCentralAmerica SubRegionCentralAsia SubRegionEasternAfrica SubRegionEasternAsia SubRegionEasternEurope SubRegionMelanesia SubRegionMicronesia SubRegionMiddleAfrica SubRegionNorthernAfrica SubRegionNORTHERNAMERICA SubRegionNorthernEurope SubRegionPolynesia SubRegionSouth-EasternAsia SubRegionSouthAmerica SubRegionSouthernAfrica SubRegionSouthernAsia SubRegionSouthernEurope SubRegionWesternAfrica SubRegionWesternAsia SubRegionWesternEurope Year1991 Year1992 Year1993 Year1994 Year1995 Year1996 Year1997 Year1998 Year1999 Year2000 Year2001 Year2002 Year2003 Year2004 Year2005 Year2006 Year2007 Year2008 Year2009 Year2010 Year2011 Year2012 Year2013 Year2014 Year2015 Year2016 Year2017
987 8010 5074 3170 9188 2516 5949 2997 2662 5201 3917 3613 5952 1951 6051 7192 3955 5204 7232 9195 9704 5774 1.93 1.94 1.95 1.96 1.99 1.99 1.99 1.99 1.99 2.02 2.02 2.05 2.05 2.06 2.07 2.08 2.09 2.11 2.1 2.11 2.12 2.12 2.13 2.13 2.11 2.1 2.11 6800 3922 2144 5695 2121 4771 2285 2122 3486 2952 3717 5922 1587 4473 5704 3051 3458 6390 5730 8125 5967
Figure XX: VIF of lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*Year.
log(gdp.per.capita) log(gdp.per.capita):Year1991 log(gdp.per.capita):Year1992 log(gdp.per.capita):Year1993 log(gdp.per.capita):Year1994 log(gdp.per.capita):Year1995 log(gdp.per.capita):Year1996 log(gdp.per.capita):Year1997 log(gdp.per.capita):Year1998 log(gdp.per.capita):Year1999 log(gdp.per.capita):Year2000 log(gdp.per.capita):Year2001 log(gdp.per.capita):Year2002 log(gdp.per.capita):Year2003 log(gdp.per.capita):Year2004 log(gdp.per.capita):Year2005 log(gdp.per.capita):Year2006 log(gdp.per.capita):Year2007 log(gdp.per.capita):Year2008 log(gdp.per.capita):Year2009 log(gdp.per.capita):Year2010 log(gdp.per.capita):Year2011 log(gdp.per.capita):Year2012 log(gdp.per.capita):Year2013 log(gdp.per.capita):Year2014 log(gdp.per.capita):Year2015 log(gdp.per.capita):Year2016 log(gdp.per.capita):Year2017 Year1991 Year1992 Year1993 Year1994 Year1995 Year1996 Year1997 Year1998 Year1999 Year2000 Year2001 Year2002 Year2003 Year2004 Year2005 Year2006 Year2007 Year2008 Year2009 Year2010 Year2011 Year2012 Year2013 Year2014 Year2015 Year2016 Year2017
34.7 187 179 181 177 183 185 186 185 183 185 186 187 187 189 191 194 197 202 209 213 216 219 221 223 224 222 222 187 180 181 177 184 187 189 187 185 187 188 190 192 196 200 205 210 217 223 228 232 236 239 240 240 238 238

Although adding more features into our regression model results in higher R2 values, the Variance Inflation Factor (VIF) for each are extremely high so we will reject those models as those added features are highly correlated with each other. Therefore, we will stick with our second model fit2.

We can then predict a country’s deaths caused from air pollution in a given year by using the country’s GDP per capita with the following equation:

\[ log(Deaths_{from~air~pollution|per~year|per~country} / 100,000) = 10.07849 - 0.38952 * log(GDP_{per capita}) ~~~~~~~~~~~~~~~~ eqn (1) \]

or solving for our target variable:

\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]

Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?

We all know that correlation does not necessarily mean causation. Let us dig a little deeper and test if means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.

One-Way ANOVA Test

We start off by performing a One-Way ANOVA test to determine if the means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.

H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp

H1: At least one of \(\mu\)deaths_lowest_gdp, \(\mu\)deaths_low_gdp, \(\mu\)deaths_medium_gdp, \(\mu\)deaths_high_gdp is not equal

We will use an \(\alpha\) value of 0.05.

The p-valuetest1 is 0e+00, which is lower than \(\alpha\)0.05. Therefore, we reject our null hypothesis that \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp. This means that there is statistically significant that at least one of the means of deaths in low, medium, and high GDP per capita are not the same.

2-Sample T-Tests

We will conduct 6 2-sample t-tests to determine if each of the groupings are different from each other:

  • Lowest GDP per capita’s deaths does not equal Low GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
  • Low GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
    • H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp
    • H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_medium_gdp
  • Medium GDP per capita’s deaths does not equal High GDP per capita’s deaths
    • H0: \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
    • H1: \(\mu\)deaths_medium_gdp != \(\mu\)deaths_high_gdp
  • Lowest GDP per capita’s deaths does not equal High GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_high_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_high_gdp
  • Lowest GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_medium_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_medium_gdp
  • Low GDP per capita’s deaths does not equal Highest GDP per capita’s deaths
    • H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_high_gdp
    • H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_high_gdp

We will use a two sample t-test for each and use an \(\alpha\) value of 0.05.

Test 1:

p-valuetest1: 2.99e-203

p-valuetest1 < \(\alpha\)0.05 = TRUE

Conclusion of test1: p-valuetest1 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_low_gdp and accept our alternative hypothesis.

Test 2:

p-valuetest2: 1.47e-13

p-valuetest2 < \(\alpha\)0.05 = TRUE

Conclusion of test2: p-valuetest2 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.

Test 3:

p-valuetest3: 0e+00

p-valuetest3 < \(\alpha\)0.05 = TRUE

Conclusion of test3: p-valuetest3 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_medium_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

Test 4:

p-valuetest4: 2.91e-06

p-valuetest4 < \(\alpha\)0.05 = TRUE

Conclusion of test4: p-valuetest4 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

Test 5:

p-valuetest5: 4.79e-48

p-valuetest5 < \(\alpha\)0.05 = TRUE

Conclusion of test5: p-valuetest5 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.

Test 6:

p-valuetest6: 4.17e-70

p-valuetest6 < \(\alpha\)0.05 = TRUE

Conclusion of test6: p-valuetest6 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

6. Conclusion

Main Research Results

From all of our tests, we can confirm that the means of deaths caused by air pollution are statistically significant when grouped by different levels of GDP per capita. This reinforces the idea that deaths caused by air pollution has a significant relationship with GDP per capita and the model can be quantified by Equation 2:

\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]

The strength of the correlation can be quantified by our R2 value of 0.737 from Figure XX.

Areas of Further Analysis

This data set has many avenues for further statistical analysis and modeling. Some potential areas for further analysis include:

  • Can we quantify or create a mathematical model of what GDP per capita value reaches the asymptotic relationship we observed in the log data set?
  • Can we build a better performing predictor for Deaths per 100k due to Air Pollution using more powerful models (Random Forests, Gradient Boosting, SVMs) and/or by including more features?

7. Bibliography

Figure X: References
Number APA Citation
1 Robin Lovelace, J. N. (n.d.). Chapter 8 Making maps with R: Geocomputation with R. Retrieved October 28, 2021, from https://geocompr.robinlovelace.net/adv-map.html
2 Robin Lovelace, J. N. (2021, October 28). Chapter 2 Geographic data in R: Geocomputation with R. Retrieved from https://geocompr.robinlovelace.net/spatial-class.html#intro-sf
3 Hadley Wickham, D. N. (2021, October 28). 6 Maps. Retrieved from https://ggplot2-book.org/maps.html
4 Customizing ggplot2 color and fill scales. (2021, October 28). Retrieved from https://spielmanlab.github.io/introverse/articles/color_fill_scales.html
5 Logarithmic Functions. (2021, October 28). Retrieved from https://saylordotorg.github.io/text_intermediate-algebra/s10-03-logarithmic-functions-and-thei.html